home *** CD-ROM | disk | FTP | other *** search
- NAME
-
- bawk - text processor
-
- SYNOPSIS
-
- bawk rules [file] ...
-
- DESCRIPTION
-
- Bawk is a text processing program that searches files for
- specific patterns and performs "actions" for every occurrance
- of these patterns. The patterns can be "regular expressions"
- as used in the UNIX "ex" editor. The actions are expressed
- using a subset of the "C" language.
-
- The patterns and actions are usually placed in a "rules" file
- whose name must be the first argument in the command line.
- All other arguments are taken to be the names of text files on
- which the rules are to be applied. The special file name "-"
- may also be used anywhere on the command line to take input
- from the standard input device.
-
- The command:
-
- bawk - prog.c - prog.h
-
- would read the patterns and actions rules from the standard
- input, then apply them to the files "prog.c", the standard
- input and "prog.h" in that order.
-
- The general format of a rules file is:
-
- <pattern> { <action> }
- <pattern> { <action> }
- ...
-
- There may be any number of these <pattern> { <action> }
- sequences in the rules file. Bawk reads a line of input from
- the current input file and applies every <pattern>
- { <action> } in sequence to the line.
-
- If the <pattern> corresponding to any { <action> } is missing,
- the action is applied to every line of input. The default
- { <action> } is to print the matched input line.
-
- PATTERNS
-
- The <pattern>'s may consist of any valid C expression. If the
- <pattern> consists of two expressions seperated by a comma, it
- is taken to be a range and the <action> is performed on all
- lines of input that match the range. <pattern>'s may contain
- "regular expressions" delimited by an '@' symbol. Regular
- expressions can be thought of as a generalized "wildcard"
- string matching mechanism, similar to that used by many
- operating systems to specify file names. Regular expressions
- may contain any of the following characters:
-
- x An ordinary character (not mentioned below)
- matches that character.
- '\' The backslash quotes any character.
- "\$" matches a dollar-sign.
- '^' A circumflex at the beginning of an expression
- matches the beginning of a line.
- '$' A dollar-sign at the end of an expression
- matches the end of a line.
- '.' A period matches any single character except
- newline.
- ':x' A colon matches a class of characters
- described by the character following it:
- ':a' ":a" matches any alphabetic;
- ':d' ":d" matches digits;
- ':n' ":n" matches alphanumerics;
- ': ' ": " matches spaces, tabs, and other control
- characters, such as newline.
- '*' An expression followed by an asterisk matches
- zero or more occurrances of that expression:
- "fo*" matches "f", "fo", "foo", "fooo", etc.
- '+' An expression followed by a plus sign matches
- one or more occurrances of that expression:
- "fo+" matches "fo", "foo", "fooo", etc.
- '-' An expression followed by a minus sign
- optionally matches the expression.
- '[]' A string enclosed in square brackets matches
- any single character in that string, but no
- others. If the first character in the string
- is a circumflex, the expression matches any
- character except newline and the characters in
- the string. For example, "[xyz]" matches "xx"
- and "zyx", while "[^xyz]" matches "abc" but
- not "axb". A range of characters may be
- specified by two characters separated by "-".
- Note that, [a-z] matches alphabetics, while
- [z-a] never matches.
-
- For example, the following rules file would print every line
- that contained a valid C identifier:
-
- @[a-zA-Z][a-zA-Z0-9]@
-
- And this rules file would print all lines between and
- including the ones that contained the word "START" and "END":
-
- @START@, @END@
-
- ACTIONS
-
- Actions are expressed as a subset of the C language. All
- variables are global and default to int's if not formally
- declared. Variable declarations may appear anywhere within an
- action. Only char's and int's and pointers and arrays of char
- and int are allowed. Bawk allows only decimal integer
- constants to be used - no hex (0xnn) or octal (0nn). String
- and character constants may contain all of the special C
- escapes (\n, \r, etc.).
-
- Bawk supports the "if", "else", "while" and "break" flow of
- control constructs, which behave exactly as in C.
-
- Also supported are the following unary and binary operators,
- listed in order from highest to lowest precedence:
-
- operator type associativity
- () [] unary left to right
- ! ~ ++ -- - * & unary right to left
- * / % binary left to right
- + - binary left to right
- << >> binary left to right
- < <= > >= binary left to right
- == != binary left to right
- & binary left to right
- ^ binary left to right
- | binary left to right
- && binary left to right
- || binary left to right
- = binary right to left
-
- Comments are introduced by a '#' symbol and are terminated by
- the first newline character. The standard "/*" and "*/"
- comment delimiters are not supported and will result in a
- syntax error.
-
- FIELDS
-
- When bawk reads a line from the current input file, the record
- is automatically seperated into "fields". A field is simply a
- string of consecutive characters delimited by either the
- beginning or end of line, or a "field seperator" character
- Initially, the field seperators are the space and tab
- character. The special unary operator '$' is used to
- reference one of the fields in the current input record
- (line). The fields are numbered sequentially starting at 1.
- The expression "$0" references the entire input line.
-
- Similarly, the "record seperator" is used to determine the end
- of an input "line", initially the newline character. The
- field and record seperators may be changed programatically by
- one of the actions and will remain in effect until changed
- again.
-
- Fields behave exactly like strings; and can be used in the
- same context as a character array. These "arrays" can be
- considered to have been declared as:
-
- char ($n)[ 128 ];
-
- In other words, they are 128 bytes long. Notice that the
- parentheses are necessary because the operators [] and $
- associate from right to left; without them, the statement
- would have parsed as:
-
- char $(1[ 128 ]);
-
- which is obviously ridiculous.
-
- If the contents of one of these field arrays is altered, the
- "$0" field will reflect this change. For example, this
- expression:
-
- *$4 = 'A';
-
- will change the first character of the fourth field to an
- upper-case letter 'A'. Then, when the following input line:
-
- 120 PRINT "Name address Zip"
-
- is processed, it would be printed as:
-
- 120 PRINT "Name Address Zip"
-
- Fields may also be modified with the strcpy() function (see
- below). For example, the expression:
-
- strcpy( $4, "Addr." );
-
- applied to the same line above would yield:
-
- 120 PRINT "Name Addr. Zip"
-
- PREDEFINED VARIABLES
-
- The following variables are pre-defined:
-
- FS Field seperator (see below).
- RS Record seperator (see below also).
- NF Number of fields in current input
- record (line).
- NR Number of records processed thus far.
- FILENAME Name of current input file.
- BEGIN A special <pattern> that matches the
- beginning of input text, before the
- first record is read.
- END A special <pattern> that matches the
- end of input text, after the last
- record has been read.
-
- Bawk also provides some useful builtin functions for string
- manipulation and printing:
-
- printf(arg..) Exactly the printf() function from C.
- getline() Reads the next record from the current
- input file and returns 0 on end of
- file.
- nextfile() Closes out the current input file and
- begins processing the next file in the
- list (if any).
- strlen(s) Returns the length of its string
- argument.
- strcpy(s,t) Copies the string "t" to the string
- "s".
- strcmp(s,t) Compares the "s" to "t" and returns 0
- if they match.
- toupper(c) Returns its character argument
- converted to upper-case.
- tolower(c) Returns its character argument
- converted to lower-case.
- match(s,@re@) Compares the string "s" to the regular
- expression "re" and returns the number
- of matches found (zero if none).
-
- EXAMPLES
-
- The following rules file will scan a C program, counting the
- number of mismatched parentheses, brackets, and braces.
-
- /[()\[\]{}]/
- {
- parens = parens + match( $0, @(@ );
- parens = parens - match( $0, @)@ );
- bracks = bracks + match( $0, @[@ );
- bracks = bracks - match( $0, @]@ );
- braces = braces + match( $0, @{@ );
- braces = braces - match( $0, @}@ );
- }
- END { printf("parens=%d, brackets=%d, braces=%d\n",
- parens, bracks, braces );
- }
-
- This program will capitalize the first word in every sentence
- of a document:
-
- BEGIN
- {
- RS = '.'; # set record seperator to a period
- }
- {
- if ( match( $1, @^[a-z]@ ) )
- *$1 = toupper( *$1 );
- printf( "%s\n", $0 );
- }
-
- LIMITATIONS
-
- Bawk was originally written in BDS C, but every attempt was
- made to keep the code as portable as possible. The program
- should be compilable with any "standard" C compiler. On CP/M
- systems compiled with BDS C, bawk takes up about 24K.
-
- An input record may be no longer than 128 characters. If
- longer records are encountered, they terminate prematurely and
- the next record starts where the previous one was hacked off.
-
- A single pattern or action statement may be no longer than
- about 4K characters, excluding comments and whitespace. Since
- the program is semi-compiled the tokenized version will
- probably wind up being smaller than the source code, so the 4K
- figure is only approximate.
-
- AUTHOR
-
- Bob Brodt
- 486 Linden Ave.
- Bogota, NJ 07603
-
- ACKNOWLEDGEMENTS
-
- The concept for bawk (and 3/4 of the name!) was taken from the
- program "awk" written by Afred V. Aho, Brian W. Kernighan and
- Peter J. Weinberger. My apologies for any irreverences.
-
- The regular expression compiler/parser was borrowed from a
- program called "grep" and has been highly modified. Grep is
- distributed by the DEC Users Society (DECUS) and is Copyright
- (C) 1980 by DECUS. The author acknowledges DECUS with a nod
- of thanks for giving their general permission and okey-dokey
- to copy or modify the grep program.
-
- UNIX is a trademark of AT&T Bell Labs.
-
-